Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find multiple review articles summarizing progress in the field [@subramanian2020; @graw2021; @heo2021; @picard2021; @reel2021; @vlachavas2021; @adossa2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction) [@simon2003]. Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning.
One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [@argelaguet2018; @argelaguet2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.
MOFA starts with a standard method – Latent Factor Analysis – that is known to work well on a single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that, at least in come cases, it is approximately equivalent to a two-step process:
That re-interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call “plasma”, to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression to learn the single omics models and linear regression to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).
Our computational method is implemented and the data are available in
the plasma package.
suppressWarnings( library(plasma) )
packageVersion("plasma")
## [1] '0.7.6'
The results included here are in whole or part based upon data generated by the TCGA Research Network. We downloaded the entire esophageal cancer Level 3 data set [@tcga2017] from the Genomics Data Commons (GDC) [@jensen2017] on 6 August 2018. We filtered the data sets so that only the most variable, and presumably the most informative, features were retained. Here, we load this sample data set.
loadESCAdata()
sapply(assemble, dim)
## ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,] 53 6 566 1454 926 2520 192
## [2,] 185 185 184 185 166 157 126
Finally, in order to be able to illustrate the ability of the plasma algorithm to work in the presence of missing data, we randomly selected 10% of the patients to remove from the miRSeq data set (leaving 166 patients) and 15% of the patients to remove from the mRNASeq data set (leaving 157 patients). We provid a summary of the outcome data below.
We recommend imputing small amounts of missing data in the input data sets. The underlying issue is that the PLS models we use for individual omics data sets will not be able to make predictions on a sample if even one data point is missing. As a result, if a sample is missing at least one data point in every omics data set, then it will be impossible to use that sample at all.
For a range of available methods and R packages, consult the CRAN Task View on
Missing Data. We also recommend the R-miss-tastic web site on
missing data. Their simulations suggest that, for purposes of
producing predictive models from omics data, the imputation method is
not particularly important. Because of the latter finding, we have only
implemented two simple imputation methods in the plasma
package:
meanModeImputer will replace any missing data by the
mean value of the observed data if there are more than five distinct
values; otherwise, it will replace missing data by the mode. This
approach works relatively well for both continuous data and for binary
or small categorical data.samplingImputer replaces missing values by sampling
randomly from the observed data distribution.set.seed(54321)
imputed <- lapply(assemble, samplingImputer)
The plasma algorithm is based on Partial Least Squares
(PLS), which has been shown to be an effective method for finding
components that can predict clinically interesting outcomes [@bastien2015]. The workflow of the plasma
algorithm is illustrated in Figure 1 in the case of
three omics data sets. First, for each of the omics data sets, we apply
the PLS Cox regression algorithm (plsRcox Version 1.7.6
[@bertrand2021]) to the time-to-event
outcome data to learn three separate predictive models (indicated in
red, green, and blue, respectively). Each of these models may be
incomplete, since they are not defined for patients who have not been
assayed (shown in white) using that particular omics technology. Second,
for each pair of omics data sets, we apply the PLS linear regression
algorithm (pls Version 2.8.1 [@mishra2022]) to learn how to predict the
coefficients of the Cox regression components from one data set using
features from the other data set. This step extends (shown in pastel
red, green, and blue, resp.) each of the original models, in different
ways, from the intersection of samples assayed on both data sets to
their union. Third, we average all of the different extended models
(ignoring missing data) to get a single coherent model of component
coefficients across all omics data sets. Assuming that this process has
been applied to learn the model from a training data set, we can
evaluate the final Cox regression model on both the training set and a
test set of patient samples.
Figure 1 : Workflow schematic for plasma algorithm with three omics data sets. See main text for an explanation.
All computations were performed in R version 4.2.1 (2022-06-23 ucrt)
of the R Statistical Software Environment [@Rbook]. Cox proportioanl hazards models for
survival analysis were fit using version 3.3.1 of e
survival R package. We used additional exploratory
graphical tools from version 1.3.1 of the beanplot R
package [@kampstra2008] and version 1.5.1
of the Polychrome R package [@coombes2019].
Because of the layered nature of the plasma algorithm, we intend to use the following terminology to help clarify the later discussions.
To be consistent with the MOFA2 R package [@argelaguet2020], all of the data sets are
arranged so that patient samples are columns and assay features are
rows. Our first task is to pad each data set with appropriate
NA’s to ensure that each set includes the same patient
samples in the same order, where that order matches the outcome data
frame.
MO <- prepareMultiOmics(imputed, Outcome)
summary(MO)
## Datasets:
## ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,] 53 6 566 1454 926 2520 192
## [2,] 185 185 185 185 185 185 185
## Outcomes:
## patient_id vital_status days_to_death days_to_last_followup Days
## a3i8 : 1 alive:108 Min. : 9.0 Min. : 4.0 Min. : 4.0
## a3ql : 1 dead : 77 1st Qu.: 180.0 1st Qu.: 336.5 1st Qu.: 232.0
## a3y9 : 1 Median : 351.0 Median : 402.5 Median : 400.0
## a3ya : 1 Mean : 495.2 Mean : 570.1 Mean : 538.9
## a3yb : 1 3rd Qu.: 650.0 3rd Qu.: 696.8 3rd Qu.: 681.0
## a3yc : 1 Max. :2532.0 Max. :3714.0 Max. :3714.0
## (Other):179 NA's :108 NA's :77
We see that the number of patients in each data set is now equal to the number of patients with clinical outcome data.
As indicated above, we want to separate the data set into training and test samples. We will use 60% for training and 40% for testing.
set.seed(54321)
splitVec <- rbinom(nrow(Outcome), 1, 0.6)
Figure 2 presents a graphical overview of the number
of samples (N) and the number of features (D)
in each omics component of the training and test sets.
trainD <- MO[, splitVec == 1]
testD <- MO[, splitVec == 0]
opar <- par(mai = c(1.02, 1.32, 0.82, 0.22), mfrow = c(1,2))
plot(trainD, main = "Train")
plot(testD, main = "Test")
Figure 2 : Overview of training and test data.
par(opar)
The first step of the plasma algorithm is to fit PLS Cox
models on each omics data set using the function
fitCoxModels. The returned object of class
MultiplePLSCoxModels contains a list of
SingleModel objects, one for each assay, and within each
there are three regression models:
plsRcoxmodel contains the coefficients of the
components learned by PLS Cox regression. The number of components is
determined automatically as a function of the logarithm of the number of
features in the omics data set. The output of this model is a continuous
prediction of “risk” for the time-to-event outcome of interest.riskModel is a coxph model using
continuous predicted risk as a single predictor.splitModel is a coxph model using a
binary split of the risk (at the median) as the predictor.suppressWarnings( firstPass <- fitCoxModels(trainD, timevar = "Days",
eventvar = "vital_status", eventvalue = "dead") )
summary(firstPass)
## An object containing MultiplePLSCoxModels based on:
## [1] "ClinicalBin" "ClinicalCont" "MAF" "Meth450" "miRSeq" "mRNASeq"
## [7] "RPPA"
if (!interactive()) {
plot(firstPass, legloc = "bottomleft") # margins too small inside RStudio window
}
Figure 3 : Kaplan-Meier plots of overall survival on the training set from separate PLS Cox omics models
On the training set, each of the seven contributing omics data sets is able to find a PLS model that can successfully separate high risk from low risk patients (Figure 3).
The second step of the algorithm is to extend the individual
omics-based models across other omics data sets. This step is performed
using the plasma function, which takes in the previously
created objects of class multiOmics and
MultiplePLSCoxModels. The function operates iteratively, so
in our case there are seven different sets of predictions of the PLS
components. These different predictions are averaged and saved
internally as a data frame called meanPredictions. The
structure of models created and stored in the plasma object
is the same as for the separate, individual, omics models.
Figure 4 shows the Kaplan-Meier plot using the
predicted risk, split at the median value, on the training data set.
pl <- plasma(MO, firstPass)
plot(pl, legloc = "topright", main = "Training Data", xlab = "Time (Days)")
Figure 4 : Kaplan-Meier plot of overall survival on the training set
using the unified plasma Cox model.
Now we want to see how well the final composite model generalizes to our test set. Figure 5 uses the predicted risk, split at the median of the training data, to construct a Kaplan-Meier plot on the test data. The model yields a statistically significant (p = 0.0063) separation of outcomes between the high and low risk patients.
testpred <- predict(pl, testD)
plot(testpred, main="Testing Data", xlab = "Time (Days)")
Figure 5 : Kaplan-Meier plot of overall survival on the test set.
At this point, our model appears to be a fairly complex black box. We have constructed a matrix of components, based on linear combinations of actual features in different omics data sets. These components can then be combined in yet another linear model that predicts the time-to-event outcome through Cox regression. In this section, we want to explore how the individual features from different omics data sets contribute to different model components.
Our first act toward opening the black box is to realize that not all of the components discovered from the individual omics data sets survived into the final composite model. Some components were eliminated because they appeared to be nearly linearly related to components found in other omics data sets. So, we can examine the final composite model more closely.
pl@fullModel
## Call:
## coxph(formula = Surv(Days, vital_status == "dead") ~ ClinicalBin1 +
## MAF1 + Meth4501 + Meth4502 + Meth4503 + Meth4504 + mRNASeq1 +
## RPPA2 + RPPA3, data = riskDF)
##
## coef exp(coef) se(coef) z p
## ClinicalBin1 0.7458 2.1082 0.4767 1.564 0.117715
## MAF1 2.2183 9.1914 0.8789 2.524 0.011602
## Meth4501 -0.1429 0.8668 0.0712 -2.007 0.044727
## Meth4502 1.6820 5.3761 0.8465 1.987 0.046930
## Meth4503 1.1703 3.2228 0.3373 3.469 0.000522
## Meth4504 -1.8067 0.1642 0.7077 -2.553 0.010683
## mRNASeq1 0.2624 1.3001 0.1003 2.615 0.008911
## RPPA2 0.9545 2.5975 0.2887 3.306 0.000947
## RPPA3 0.8029 2.2321 0.3239 2.479 0.013169
##
## Likelihood ratio test=60.41 on 9 df, p=1.12e-09
## n= 185, number of events= 77
temp <- terms(pl@fullModel)
mainterms <- attr(temp, "term.labels")
rm(temp)
mainterms
## [1] "ClinicalBin1" "MAF1" "Meth4501" "Meth4502" "Meth4503" "Meth4504"
## [7] "mRNASeq1" "RPPA2" "RPPA3"
We see that at least one component discovered from four of the five “true” omics data sets survived in the final model; only the miR components failed to make the cut. In addition, one component from the binary clinical data ws retained in the final model.
Our interest now turns to understanding how the features from
individual omics data sets contribute to the components that are used in
the final model. As mentioned earlier, these contributions are mediated
through two levels of linear regression models when extending a model
from data set A to data set B. A linear combination of features from set
B is used to define the secondary level of components; then a linear
combination of these components is used to predict the components of the
single Cox model built that had been from set A. These weights can be
combined and extracted using the getAllWeights function,
and can then be explored.
We use the binary clinical data set to begin illustrating one method for interpreting the components.
library(oompaBase)
HG <- blueyellow(64)
cbin <- getAllWeights(pl, "ClinicalBin")
compcolors <- c("forestgreen", "purple")[1 + 1*(colnames(cbin@contrib) %in% mainterms)]
heat(cbin, cexCol = 0.9, cexRow = 0.5, col = HG, ColSideColors = compcolors)
Figure 6 : Unscaled heatmap of the contributions of binary clinical features to all components (purple = retained in final model, green = not retained).
Figure 6 shows the raw weights for each clinical binary feature in all of the original omics components. We would like to simplify this plot in several ways. First, we can remove any components that were not retained in the final model (indicated by the green color bar in the top dendrogram). Second, we hypothesize that some components intrinsically have a wider spread of weights, and that it might be more important to scale the components consistently to look at the relative contributions. Finally, we can remove any features that seem to make no contributions to any of the components; that is; those that do not have highly ranked weights (by absolute value) in any component.
shrink <- function(dset, N) {
dset@contrib <- scale(dset@contrib) # standardize
feat <- unique(unlist(as.list(getTop(dset, N)))) # remove useless features
dset@contrib <- dset@contrib[feat, mainterms] # remove unused components
dset
}
xbin <- shrink(cbin, 4)
heat(xbin, cexCol = 0.9, cexRow = 0.9, col = HG)
Figure 7 : Scaled heatmap of the contributions of filtered binary clinical features to important components.
In Figure 7, we can identify strong contrasts between several pairs of variables. For example, one set of components is enriched with white, never smokers, who still have evidence of tumors, at stage T3 and grade 3 in the lower third of the esophagus (ICD-10 code C15.5), while another group is enriched for Asian, current smokers, who are tumor-free, with stage N0, T2 tumors from the lower third of the esophagus (ICD-10 code C15.4).
We can apply the same method to visualize contributors from each of the omics data sets. As a second illustration, we look at the standardized weights from the mRNA data set in the components that are part of the final model, keeping only those features that are highly ranked by absolute weight in at least one component (Figure 8).
mrna <- getAllWeights(pl, "mRNASeq")
xmrna <- shrink(mrna, 7)
tmp <- rownames(xmrna@contrib)
rownames(xmrna@contrib) <- sapply(strsplit(tmp, "\\."), function(x) x[1])
heat(xmrna, cexCol = 0.9, cexRow = 0.6, col = HG)
Figure 8 : Scaled heatmap of the contributions of filtered mRNA features to important components
One difficulty with the heatmaps in the previous section is that they are focused on individual input data sets, and not on individual components. In order to fully understand which features contribute, for example, to the first mutation component (MAF1), one would have to scan all the heatmaps from all the datasets and then try to combine the influences. In order to help with that procedure, we can merge all the contributions into a single data frame, with an accompanying factor tracking the source data set.
CW <- combineAllWeights(pl)
contra <- CW@combined
datasrc <- CW@dataSource
Figure 9 displays the mean, standard deviation (SD), median, and median absolute deviation (MAD) of the weight-contributions for each data set in each component.
image(CW)
Figure 9 : Summary statistics of weights by component and dataset.
To remain consistent with the previous heatmaps, we have standardized the weights in each data set and component. In Figure 10, we create beanplots showing the distributions of weights arising from each data set in each (retained) component. (Similar plots for the raw weights are available in Supplementary Data.) Some data sets have very different contribution patterns than others. For example, the miRSeq data set appears to have significant outliers making large contributions in almost every component, the MAF and RPPA data sets also frequently (but not always) include such outliers.
library(beanplot)
library(Polychrome)
data(palette36)
foo <- computeDistances(palette36[3:36])
colist <- as.list(palette36[names(foo)[1:7]])
brute <- stdize(CW, "standard")
opar <- par(mfrow = c(3,3), mai = c(0.2,0.2, 0.5, 0.2))
for (i in which(colnames(contra) %in% mainterms)) {
beanplot(brute[, i] ~ datasrc, what = c(1,1,1,0), col = colist,
main = paste("Std Wts, Component", i))
}
Figure 10 : Distributions of standardized weights by data set and component.
par(opar)
rm(opar)
We also include plots of the histograms of distributions by component (Figure 11). None of these really looks quite normal; almost all have some slightly odd shape.
opar <- par(mfrow = c(3, 3), mai = c(0.2,0.2, 0.5, 0.2))
for (i in which(colnames(contra) %in% mainterms)) {
hist(brute[,i], breaks = 77, main = paste("Component", i))
}
Figure 11 : Histogram of standardized weights by component.
par(opar)
rm(opar)
Next, we want to see how many items in the lists of “top twenty” contributers to each component come from each data set. The results are shown in Figure 12. Using the raw weights, the vast majority of contributions come from the clinical binary data, with secondary contributions from the MAF and RPPA data sets (as we expected from the above distribution plots). After standardization, most of the contributions arise from miRs, but the methylation, mRNA and, to a lesser extent, the MAF and RPPA data sets also are present.
top20 <- apply(contra, 2, function(X) {
A <- abs(X)
S <- rev(sort(A))
which(A > S[21])
})
top20types <- apply(top20, 2, function(X) {
table(datasrc[X])
})
chop20 <- apply(brute, 2, function(X) {
A <- abs(X)
S <- rev(sort(A))
which(A > S[21])
})
chop20types <- apply(chop20, 2, function(X) {
table(datasrc[X])
})
opar <- par(mfrow = c(1, 2))
image(1:7, 1:24, top20types, main = "Raw Weights", ylab = "Components", xlab = "Data Sets")
mtext(levels(datasrc), side = 3, at = 1:7, line = 0)
image(1:7, 1:24, chop20types, main = "Standardized Weights", ylab = "Components", xlab = "Data Sets")
mtext(levels(datasrc), side = 3, at = 1:7, line = 0)
Figure 12 : Number of contributers to top twenty lists.
par(opar)
rm(opar)
In Figure 13, we pool all the weights (across all data sets and all components) to look at histograms of the distributions. We also overlay the theoretical normal distribution that one would expect to see. Using the usual mean and distribution (right panel), the actual weights are slightly more conservative (i.e., concentrated near zero) than expected. This figure suggests that we might want to use standardization, and decide on significance purely from the theoretical normal distribution rather than from deviations away from that distribution.
opar <- par(mfrow = c(1, 2))
hist(contra, breaks = 123, main = "Raw Weights", prob = TRUE)
xx <- seq(-10, 10, length=1001)
yy <- dnorm(xx, mean(contra), sd(contra))
lines(xx, yy, col = "red", lwd=2)
hist(brute, breaks = 123, main = "Standardized Weights", prob = TRUE)
yy <- dnorm(xx)
lines(xx, yy, col = "red", lwd=2)
Figure 13 : Histograms of all weights (combined).
par(opar)
rm(opar)
Finally, we create yet another image (Figure 14), counting the number of significant features from each data set for each component, when using a significance cutoff of 5% derived from the standard normal distribution. We feel that this result is more reasonable than any thing we got just by looking at the top 20 lists. Most contributions come from the biggest omics data sets (mRNA, methylation, and miR) with fewer from MAF, RPPA, and clinical binary.
Q <- qnorm(0.975) # two-sided 5% cutoff
top1p <- aggregate(brute, list(datasrc), function(X) sum(abs(X) > Q)) # top 5 percent
rownames(top1p) <- top1p[, 1]
top1p <- as.matrix(top1p[, -1])
top1p <- top1p[, mainterms]
image(1:7, 1:9, top1p, ylab = "Components", xlab = "Data Sets")
mtext(levels(datasrc),side = 3, at = 1:7, line = 0)
Figure 14 : Number of significant contributions by data set and component.
In the final Cox proportional hazards model of overall survival, the
component with the largest hazard ratio was “MAF1” (the
first feature discovered from the MAF mutation data set), for which each
one unit increase in the component corresponds to a 9-fold increase in
the hazard. Our next goal is to find a biological interpretation of this
component. First, here is an overview of all the contributing
features.
pickA <- interpret(CW, "MAF1", 0.05)
summary(pickA)
## Feature Source Weight
## Length:286 ClinicalBin : 3 Min. :-5.058
## Class :character ClinicalCont: 1 1st Qu.:-2.599
## Mode :character MAF : 6 Median :-2.178
## Meth450 : 83 Mean :-1.216
## miRSeq : 46 3rd Qu.: 1.961
## mRNASeq :131 Max. : 7.607
## RPPA : 16
We start by looking more closely at the clinical features.
pickA[1:4, ]
## Feature Source Weight
## gender.female gender.female ClinicalBin -2.153483
## gender.male gender.male ClinicalBin 2.113973
## pathology_T_stage.t2 pathology_T_stage.t2 ClinicalBin -2.170436
## weight weight ClinicalCont 2.010542
Because of our decision to use one-hot encoding of categorical
variables, our data set includes separate features for “male”
and”female”. Both terms are strongly related to the MAF1 component, but
(as one would hope) with approximately equal standardized weights but
opposite signs. Being female decreases the hazard; being male increases
it. Since the coefficient of MAF1 in the final model of
overall survival is itself positive, we can infer the direction that the
hazard changes. It is harder to “eyeball” the magnitude of the change in
the hazard, since these coefficients only measure the relative
contribution of these factors to the MAF1 component.
Many of the feature names obtained from TCGA include extra annotations that we won’t use later, so we are going to simplify them.
This code gets out all of the illumina genes and not just the first element of the list
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidyr)
illumina<- read_csv("https://webdata.illumina.com/downloads/productfiles/humanmethylation450/humanmethylation450_15017482_v1-2.csv", skip=7)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 486428 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): IlmnID, Name, AddressA_ID, AlleleA_ProbeSeq, AlleleB_ProbeSeq, Infinium_Design_Typ...
## dbl (4): AddressB_ID, Genome_Build, MAPINFO, Coordinate_36
## lgl (4): Random_Loci, Methyl27_Loci, Enhancer, DHS
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
illuminaselect <- illumina %>% select( c("IlmnID", "UCSC_RefGene_Name"))
illuminaselect2<- separate_rows(illuminaselect, UCSC_RefGene_Name, sep=";")
illuminaselect2 <- as.data.frame(illuminaselect2)
illuminaselect3<- illuminaselect2 [which(illuminaselect2$IlmnID %in% pickA$Feature),]
illuminaselect4<- illuminaselect3[match(pickA$Feature[ pickA$Source=="Meth450"], illuminaselect3$IlmnID),]
illuminaselect4$UCSC_RefGene_Name[which(is.na(illuminaselect4$UCSC_RefGene_Name))] <- ""
rap <- pickA$Feature[pickA$Source == "RPPA"]
rap <- sapply(strsplit(rap, "\\."), function(x) x[1])
map <- pickA$Feature[pickA$Source == "MAF"]
nap <- pickA$Feature[pickA$Source == "mRNASeq"]
nap <- sapply(strsplit(nap, "\\."), function(x) x[1])
nap <- nap[-1]
nick <- rep("", nrow(pickA))
nick[pickA$Source == "RPPA"] <- rap
nick[pickA$Source == "MAF"] <- map
nick[pickA$Source == "mRNASeq"] <- c("", nap)
nick[pickA$Source=="Meth450"]<- illuminaselect4$UCSC_RefGene_Name
pickA$Nickname <- nick
#write.csv(pickA, file = "pickMAF1.csv")
Here are the mutated genes (from the MAF data set) associated with
the component “MAF1”.
pickA[nick %in% map, ]
## Feature Source Weight Nickname
## CFAP54 CFAP54 MAF -2.045903 CFAP54
## DNAH9 DNAH9 MAF -2.061517 DNAH9
## DYNC2H1 DYNC2H1 MAF -1.994866 DYNC2H1
## FBN3 FBN3 MAF -2.029872 FBN3
## MYH13 MYH13 MAF -2.008811 MYH13
## PCDHA12 PCDHA12 MAF -1.995470 PCDHA12
Note that they all have negative coefficients, meaning that having
these mutations decreases the effect of this component. Since the
coefficient of MAF1 in the final model of overall survival
is itself positive, that means that having any of these mutations
decreases the hazard for that patient. Here are the Entrez Gene
descriptions of the genes:
These descriptions clearly indicate some common functional relationships between the mutated genes, including a role in cilia, flagella, and microfibrils.
We then extracted all gene names from the MAF, mRNASeq, and RPPA data sets and used thm to perform a gene enrichment (pathway) analysis using ToppGene [@chen2009]. Associated GeneOntology Biological Process categories included keratinization, epidermal and epithelial cell development, cell adhesion, intermediate filament organization, and wound healing. GeneOntology Cellular Components included cell-cell junctions, extracellular matrix, and intermediate filaments. Associated human phenotypes included hyperkeratosis (particularly follicular hyperkeratosis), epidermal thickening, and oral leukoplakia. Associated pathways included keratinization, gap junction assembly and trafficking, and both ErbB and mTOR signaling. Associated cytogenetic regions included 1q21-1q22, 18q12.1, and 12q12.13. Associated gene families included cadherins, kallikreins, keratins, and gap-junction proteins. Associated diseases included hyperkertosis, squamous cell carcinoma of the head and neck, intraepithelial neoplasia, endometrial carcinoma, basal-like breast carcinoma, and esophageal carcinoma.
Look at the components that were picked out in the final model, in addition to MAF1.
print(mainterms)
## [1] "ClinicalBin1" "MAF1" "Meth4501" "Meth4502" "Meth4503" "Meth4504"
## [7] "mRNASeq1" "RPPA2" "RPPA3"
mylist<- list()
for (i in 1:length(mainterms)){
pickA <- interpret(CW, mainterms[i], 0.05)
rap <- pickA$Feature[pickA$Source == "RPPA"]
rap <- sapply(strsplit(rap, "\\."), function(x) x[1])
map <- pickA$Feature[pickA$Source == "MAF"]
nap <- pickA$Feature[pickA$Source == "mRNASeq"]
nap <- sapply(strsplit(nap, "\\."), function(x) x[1])
nap <- nap[-1]
nick <- rep("", nrow(pickA))
nick[pickA$Source == "RPPA"] <- rap
nick[pickA$Source == "MAF"] <- map
nick[pickA$Source == "mRNASeq"] <- c("", nap)
illuminaselect <- illumina %>% select( c("IlmnID", "UCSC_RefGene_Name"))
illuminaselect2<- separate_rows(illuminaselect, UCSC_RefGene_Name, sep=";")
illuminaselect2 <- as.data.frame(illuminaselect2)
illuminaselect3<- illuminaselect2 [which(illuminaselect2$IlmnID %in% pickA$Feature),]
illuminaselect4<- illuminaselect3[match(pickA$Feature[ pickA$Source=="Meth450"], illuminaselect3$IlmnID),]
illuminaselect4$UCSC_RefGene_Name[which(is.na(illuminaselect4$UCSC_RefGene_Name))] <- ""
nick[pickA$Source=="Meth450"]<- illuminaselect4$UCSC_RefGene_Name
pickA$Nickname <- nick
mylist[[i]] <- pickA
}
names(mylist) <- mainterms
for (i in 1:length(mylist)){
write.table(mylist[[i]]$Nickname, file=file.path("ToppGeneStuff", paste0("pick_", names(mylist)[i], ".txt")), quote = FALSE,
col.names =FALSE, row.names = FALSE)
}
Read in the .txt files containing results from ToppGene. Some interesting features to subset on are GO: Biological Process, Pathway, Coexpression, Coexpression Atlas, ToppCell Atlas, and Disease. These seem to be relevant to our interests and have a decent number of “Genes from Input”. Take the top 20 (based on lowest pvalues) of these enrichments and create a list of dataframes.
readinthese <- file.path("ToppGeneStuff", paste0("toppgene_", mainterms))
interested<- c("GO: Biological Process", "Pathway", "Coexpression", "Coexpression Atlas", "ToppCell Atlas", "Disease")
cutoffN <- 5
toppgeneres <- list()
for (i in 1:length(readinthese)){
temp <- read.table(paste0(readinthese[i],".txt"), sep="\t", header=TRUE, fill=TRUE, quote="")
temp2 <- temp[which(temp$Category %in% interested),]
yeehaw <- list()
for (k in 1:length(interested)){
temp3 <-temp2[temp2$Category==interested[k],]
temp4 <- head(temp3[order(temp3$p.value, decreasing = FALSE),], cutoffN)
yeehaw[[k]] <- temp4
yeet <- do.call(rbind, yeehaw)
}
toppgeneres[[i]] <- yeet
}
names(toppgeneres) <- mainterms
d<- do.call(rbind, toppgeneres) %>%
mutate( ComponentName =sapply(strsplit(rownames(.), "\\."), `[[`, 1) ) %>%
select(ComponentName, everything()) %>%
mutate(neglogpval = - log10(p.value))
library(ggplot2)
pdf(file=file.path("ToppGeneStuff", "toppgene_bubble.pdf"),width=20, height=20)
d %>% ggplot(aes(x=Name, y=ComponentName, size=as.numeric(Hit.Count.in.Query.List)))+
geom_point(aes(color=neglogpval))+
scale_colour_gradient(low = "pink", high="darkred")+
scale_size(range=c(1,10), name="Hit Count in Query")+
scale_x_discrete(guide=guide_axis(angle=90))+
theme(panel.grid.major=element_line(colour="black"),
text=element_text(size=10))
dev.off()
## png
## 2
Clustering the components for a neater visual:
uhh <- d[,c("ComponentName", "Name", "neglogpval")]
uhh[uhh==""]<-NA
uhh2 <- uhh %>% pivot_wider(values_from = neglogpval, names_from = Name, values_fn = mean)
uhh2[is.na(uhh2)]<-0
h <- hclust(d=dist(uhh2))
## Warning in dist(uhh2): NAs introduced by coercion
ComponentNameNew<- uhh2$ComponentName[ order.dendrogram(as.dendrogram(h)) ]
myL <-list()
for (i in 1:length(ComponentNameNew)){
d_sub <- d[ComponentNameNew[i] == d$ComponentName,]
myL[[i]] <- d_sub
}
d_new <- do.call(rbind, myL)
pdf(file=file.path("ToppGeneStuff", "toppgene_bubble_clustered.pdf"),width=20, height=22)
d_new %>% ggplot(aes(x=Name, y=ComponentName, size=as.numeric(Hit.Count.in.Query.List)))+
geom_point(aes(color=neglogpval))+
scale_colour_gradient(low = "pink", high="darkred")+
scale_size(range=c(1,10), name="Hit Count in Query")+
scale_x_discrete(guide=guide_axis(angle=90))+
scale_y_discrete(limits=d_new$ComponentName)+
theme(panel.grid.major=element_line(colour="black"),
text=element_text(size=10))
dev.off()
## png
## 2
We see that components can be reasonably grouped together based on the top 5 items from the ontology categories we have chosen from ToppFunn (GO: Biological Process, Pathway, Coexpression, Coexpression Atlas, ToppCell Atlas, and Disease). In the largest group, we see that Meth4504, Meth4502, ClinicalBin1, mRNASeq1, Meth4503, and MAF1 are quite similar to each other. This group appears to be enriched for the term “Squamous cell”. The only thing separating mRNASeq1, Meth4503, and MAF1 are two myeloid-monocyte cell signatures and “Squamous cell carocinoma of the head and neck”. RPPA2 and RPPA3 are similar to each other and distinct from the group discussed previously in terms of enriched ontologies. Meth4501 also appears to be unique, and is notably the only component containing the term “adenocarcinoma” in its ToppFunn result.
We have identified a method analogous to that of MOFA
that allows us to combine different omics data without the need for
prior imputation of missing values. A major difference is that while
MOFA model learns “factors” that are composites of the
variables in an unsupervised fashion, the plasma model
learns “components” that are composites of the variables in an
supervised fashion, using the outcomes “event” and “time-to-event” as
response variables.
Although the factors from MOFA are defined such that the
efirst factor, Factor 1, accounts for the greatest variance in the
model, the factors’ may or may not be significantly associated with the
outcome, and a post-hoc survival analysis would need to be done to
assess this. It may be the case that some factors, although they are
significantly associated with outcome, account for very small variance
in the final MOFA model, which hinders interpretability.
This was the case with the TCGA-ESCA dataset, in which, when 10 factors
were learned from the MOFA model, only Factor 10 was
significantly associated with survival, while accounting for [number]
variance in the model [CITE SUPPLEMENTARY RESULTS?]. On the other hand,
the components for plasma are created in a way that
maximizes the covariance in the predictors and the response, and
therefore these components will be automatically associated to some
degree with the outcome. This could be advantageous in that dissecting
the weights associated with the components would yield a list of
variables from different omics datasets that contribute the most to
defining the outcome, and any additional analyses could be refined by
looking at these high-weighted variables most closely.
suppressWarnings( library(MOFA2) )
##
## Attaching package: 'MOFA2'
## The following object is masked from 'package:plasma':
##
## predict
## The following object is masked from 'package:stats':
##
## predict
library(ggplot2)
library(dplyr)
library(tidyr)
In the final MOFA model, we would prefer that no single variable explains the most variance in the model simply because they encompass a wide range of values. For example, the untransformed “days_to_birth” column in “ClinicalCont” has a particularly wide range.
range(assemble$ClinicalCont["days_to_birth",])
## [1] -32872 -10222
We will create a vector of names of datasets that are continuous (not binary) and scale them around 0 before running MOFA.
# scale the columns that are not binary
notbinarydata<-c("ClinicalCont","Meth450","miRSeq","mRNASeq","RPPA")
UseMeScaled<- lapply(MO@data[notbinarydata], function(x) scale(x))
UseMeScaled$ClinicalBin <- MO@data$ClinicalBin
UseMeScaled$MAF <- MO@data$MAF
oof <- create_mofa(UseMeScaled)
## Creating MOFA object from a list of matrices (features as rows, sample as columns)...
oof
## Untrained MOFA model with the following characteristics:
## Number of views: 7
## Views names: ClinicalCont Meth450 miRSeq mRNASeq RPPA ClinicalBin MAF
## Number of features (per view): 6 1454 926 2520 192 53 566
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 185
##
plot_data_overview(oof)
First, we need to output the options and alter them if necessary. There are three, called “data options” “model options” and “training options”.
dataoptions <- get_default_data_options(oof)
dataoptions
## $scale_views
## [1] FALSE
##
## $scale_groups
## [1] FALSE
##
## $center_groups
## [1] TRUE
##
## $use_float32
## [1] FALSE
##
## $views
## [1] "ClinicalCont" "Meth450" "miRSeq" "mRNASeq" "RPPA" "ClinicalBin"
## [7] "MAF"
##
## $groups
## [1] "group1"
modeloptions <- MOFA2::get_default_model_options(oof)
modeloptions$num_factors <-10
modeloptions
## $likelihoods
## ClinicalCont Meth450 miRSeq mRNASeq RPPA ClinicalBin MAF
## "gaussian" "gaussian" "gaussian" "gaussian" "gaussian" "gaussian" "gaussian"
##
## $num_factors
## [1] 10
##
## $spikeslab_factors
## [1] FALSE
##
## $spikeslab_weights
## [1] TRUE
##
## $ard_factors
## [1] FALSE
##
## $ard_weights
## [1] TRUE
modeloptions[["likelihoods"]][["ClinicalBin"]] <- "bernoulli"
modeloptions[["likelihoods"]][["MAF"]] <- "bernoulli"
modeloptions
## $likelihoods
## ClinicalCont Meth450 miRSeq mRNASeq RPPA ClinicalBin MAF
## "gaussian" "gaussian" "gaussian" "gaussian" "gaussian" "bernoulli" "bernoulli"
##
## $num_factors
## [1] 10
##
## $spikeslab_factors
## [1] FALSE
##
## $spikeslab_weights
## [1] TRUE
##
## $ard_factors
## [1] FALSE
##
## $ard_weights
## [1] TRUE
trainingoptions <- get_default_training_options(oof)
trainingoptions
## $maxiter
## [1] 1000
##
## $convergence_mode
## [1] "fast"
##
## $drop_factor_threshold
## [1] -1
##
## $verbose
## [1] FALSE
##
## $startELBO
## [1] 1
##
## $freqELBO
## [1] 5
##
## $stochastic
## [1] FALSE
##
## $gpu_mode
## [1] FALSE
##
## $seed
## [1] 42
##
## $outfile
## NULL
##
## $weight_views
## [1] FALSE
##
## $save_interrupted
## [1] FALSE
oof <- prepare_mofa(oof, data_options = dataoptions, model_options = modeloptions, training_options = trainingoptions)
## Warning in prepare_mofa(oof, data_options = dataoptions, model_options = modeloptions, : Some
## view(s) have less than 15 features, MOFA will have little power to to learn meaningful factors
## for these view(s)....
## Checking data options...
## Checking training options...
## Checking model options...
# run MOFA if the file does not exist
if (file.exists("mofaESCAScaled.Rda")){
load("mofaESCAScaled.Rda")
} else{
starttime<-Sys.time()
mofaESCAScaled <- run_mofa(oof, use_basilisk = TRUE)
save(mofaESCAScaled, file = "mofaESCAScaled.Rda")
endtime<-Sys.time()
endtime-starttime
}
plot_factor_cor
# How do the factors correlate to each other?
plot_factor_cor(mofaESCAScaled)
# Variance explained:
plot_variance_explained(mofaESCAScaled, plot_total = T)[[2]]
plot_variance_explained(mofaESCAScaled)
size <- sapply(UseMeScaled, dim)
type <- sapply(UseMeScaled, function(M) {
U <- unique(as.vector(as.matrix(M)))
ifelse (length(U) >3, "gaussian", "bernoulli")
})
heatme <- function() {
Y <- lapply(names(UseMeScaled), function(V) {
sz <- size[1, V] < 200
MOFA2::plot_weights_heatmap(mofaESCAScaled, view = V, main = V, show_colnames = sz, silent = TRUE)
})
Z <- lapply(Y, function(x) x$gtable)
W <- gridExtra::arrangeGrob(Z[[1]], Z[[2]], Z[[3]], Z[[4]],
Z[[5]], Z[[6]], Z[[7]], nrow = 3)
W
}
X <- heatme()
plot(X)
factorTopWeights <- function(J, mymofainput, mymofaoutput) {
Z <- lapply(names(mymofainput), function(V) {
plot_top_weights(mymofaoutput, view = V, factor = J) + ggplot2::ggtitle(V)
})
W <- gridExtra::grid.arrange(Z[[1]], Z[[2]], Z[[3]], Z[[4]],
Z[[5]], Z[[6]], Z[[7]],
nrow = 3)
}
factorTopWeights(J=1, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=2, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=3, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=4, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=5, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=6, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=7, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=8, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=9, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
factorTopWeights(J=10, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)
LF <- get_factors(mofaESCAScaled, groups = "all", factors = "all") [[1]]
all(rownames(MO@outcome) == rownames(LF))
## [1] TRUE
vitalstatus2 <- ifelse(MO@outcome$vital_status=="alive", 0,
ifelse(MO@outcome$vital_status=="dead", 1, NA))
LFwithOutcome<- data.frame(LF, event = vitalstatus2, time_to_event = MO@outcome$Days)
LF2 <- as.data.frame(t(LF))
OSresults <- matrix(NA, nrow=nrow(LF2), ncol = 4)
rownames(OSresults) <- rownames(LF2)
colnames(OSresults) <- c("coef", "HR", "score", "pvalue")
for (J in 1:nrow(LF2)) {
tempdf <- data.frame(Time = LFwithOutcome$time_to_event,
Event = LFwithOutcome$event,
X <- LF2[J,])
model <- coxph(Surv(Time, Event) ~ unlist(X), data = tempdf)
S <- summary(model)
val <- c(S$coefficients[1:2], S$sctest[c(1,3)])
OSresults[J,] <- val}
OSresults
## coef HR score pvalue
## Factor1 0.04034621 1.0411712 0.36946415 0.54329650
## Factor2 0.07980337 1.0830741 0.81726352 0.36598147
## Factor3 -0.03396494 0.9666054 0.06875337 0.79316059
## Factor4 -0.06938098 0.9329712 0.46431681 0.49561339
## Factor5 0.03659047 1.0372681 0.08116115 0.77572998
## Factor6 0.07235519 1.0750371 0.29335907 0.58807596
## Factor7 -0.11561998 0.8908137 0.67336613 0.41188048
## Factor8 0.13690316 1.1467171 1.37975043 0.24014366
## Factor9 0.12882066 1.1374861 0.50075710 0.47916765
## Factor10 -0.30654878 0.7359826 4.96757135 0.02582689
osmodel <- survival::coxph(Surv(LFwithOutcome$time_to_event, LFwithOutcome$event) ~. , data = LFwithOutcome)
AIC <- stats::step(osmodel)
summary(osmodel)
summary(AIC)
Factor 10 is the only factor significantly associated with survival.
s <- summary(osmodel)
coef <- s[["coefficients"]]
df <- data.frame(
factor = factor(rownames(coef), levels = rev(rownames(coef))),
p = coef[,"Pr(>|z|)"],
coef = coef[,"exp(coef)"],
lower = s[["conf.int"]][,"lower .95"],
higher = s[["conf.int"]][,"upper .95"]
)
ggplot(df, aes(x=factor, y=coef, ymin=lower, ymax=higher)) +
geom_pointrange( col='#619CFF') +
coord_flip() +
scale_x_discrete() +
labs(y="Hazard Ratio", x="") +
geom_hline(aes(yintercept=1), linetype="dotted") +
theme_bw()
The only factor significantly associated with survival outcome is Factor 10, which is marginal in terms of the variance explained in the model:
library(tidyr)
v10<- plot_variance_explained(mofaESCAScaled)
v10wide<-pivot_wider(v10$data, names_from="factor", values_from="value")
v10wide<-v10wide %>% select(-c("group"))
v10wide<-as.data.frame(v10wide)
rownames(v10wide) <- v10wide[,1]
v10wide<-v10wide[,-1]
v10wide_t<-t(v10wide)
sort(rowSums(v10wide_t), decreasing = TRUE)
## Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
## 107.469140 14.494450 12.729408 12.519620 9.193192 6.284764 6.100425 5.899442
## Factor9 Factor10
## 5.712097 4.136799
The variances don not total to 100, so they are not really like what we normally think of when we say \(R^2\). Possibly it would be easier to interpret if we force them into percentages like this:
barplot(sort( rowSums(v10wide_t)/sum(rowSums(v10wide_t)) , decreasing = TRUE ), las=2,
main="Relative R^2")
sort( rowSums(v10wide_t)/sum(rowSums(v10wide_t)) , decreasing = TRUE )
## Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
## 0.58236440 0.07854396 0.06897937 0.06784255 0.04981698 0.03405650 0.03305759 0.03196848
## Factor9 Factor10
## 0.03095328 0.02241689
This may highlight a crucial problem in MOFA, in which Factors that have a large variance explained in the model (usually the first few factors) may not be significantly related to survival outcome when doing the post-hoc analysis.
This is because these factors were learned with respect to explaining the most variance within the model, without any supervision. There is some “luck” associated with finding Factors that contribute to defining the model and are also significantly related to outcome.
Plasma solves this problem of the Factors becoming dissociated with outcome, since plasma is a supervised analysis that purposefully learns components that explain the outcome.